arXiv:1506.05709vl [stat.AP] 18Jun2015 


Bayesian Covariance Modelling of Large 
Tensor-Variate Data Sets & Inverse 
Non-parametric Learning of the Unknown 
Model Parameter Vector 

Kangrui Wang^j^ Dalia Chakrabarty^j^ 

^Department of Mathematics, University of Leicester 

Abstract Tensor-valued data are being encountered increasingly more commonly, in the biologi¬ 
cal, natural as well as the social sciences. The learning of the unknown model parameter vector 
given such data, involves covariance modelling of such data, though this can be difficult owing to the 
high-dimensional nature of the data, where the numerical challenge of such modelling can only be 
compounded by the largeness of the available data set. Assuming such data to be modelled using a 
correspondingly high-dimensional Gaussian Process {GV), the joint density of a finite set of such data 
sets is then a tensor normal distribution, with density parametrised by a mean tensor M (that is of 
the same dimensionality as the fc-tensor valued observable), and the k covariance matrices Si,..., S^. 
When aiming to model the covariance structure of the data, we need to estimate/learn {Si...Sfc} and 
M, given tha data. We present a new method in which we perform such covariance modelling by 
first expressing the probability density of the available data sets as tensor-normal. We then invoke 
appropriate priors on these unknown parameters and express the posterior of the unknowns given the 
data. We sample from this posterior using an appropriate variant of Metropolis Hastings. Since the 
classical MCMC is time and resource intensive in high-dimensional state spaces, we use an efficient 
variant of the Metropolis-Hastings algorithm-Transformation based MCMC-employed to perform ef¬ 
ficient sampling from a high-dimensional state space. Once we perform the covariance modelling of 
such a data set, we will learn the unknown model parameter vector at which a measured (or test) data 
set has been obtained, given the already modelled data (training data), augmented by the test data. 
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1 Introduction 

Let the causal relationship between observable V and model parameter S be defined as 
V = where V is tensor-variate: V G ]]jrnixm2...xmfc _ want to estimate value 
of S at which test data D(*®®*)-i.e. measured value(s) of V-is (are) realised. To do this, we 
need to learn function ^(b, which in this case is a tensor-variate function-of the model pa¬ 
rameter vector S G In the presence of training data D, such supervised learning can be 
possible by fitting known parametric forms (such as splines/wavelets) to the training data 
to learn the form of ^(■), which can thereafter be inverted and operated upon the test data 
to yield Here, training data D is this set of n values of V, each generated at a design 

point, i.e. a chosen value s^*'> of S. Thus, D ;= {(ui, ..., (u„, However, fitting 

with splines/wavelets is inadequate in that it does not capture the correlations between 
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the components of a high-dimensional function; also, the computational complications of 
such fitting-and particularly of inversion of the learnt ^(■)-increases rapidly with increase 
in dimensionality. Thus, we resort to the modelling of this high-dimensional data using a 
correspondingly high-dimensional Gaussian Process (GP), i.e. a tensor-variate GP. 

2 Method 

Thus, the joint probability distribution of a set of n realisations of the k — 1-variate V is 
a A:-variate normal distribution with mean M and k covariance matrices: 

p{V\M, Si,..., oc exp(-||(y - M) Xi A^K.. x, A-if/2) (1) 

where the covariance matrix Sp = ApAj,p = 1,..., k. Tensor-variate normal distribution 
is extensively discussed in the literature, (Xu, Yan & Yuan [3] ; Hoff [T] ) 

Equationimplies that the likelihood of n values (ui, U 2 ,..., u„) of V given the 
unknown tensor-variate parameters of the GP is /c-tensor variate normal. So, we write 
this likelihood and thereafter the posterior probability density of these unknown tensor- 
variate parameters given the training data (subsequent to the invoking of the priors on 
each unknown). Once this is achieved, we then sample from the posterior using an MCMC 
technique to achieve marginal density distributions of each uknown. The learning of 
could be undertaken by writing the posterior predictive distribution of S given the test data 
^(test) ^ and given the tensor-variate parameters learnt using the training data. However, 
we decide to write the joint posterior probability density of and all the other tensor- 

variate parameters given training-ftest data, and sample from this density to obtain the 
marginals of all the unknowns. 

The first step is to write the likelihood of D given the tensor-variate mean and co- 
variance matrices of the GP. Here the mean matrix is M G 

possible to estimate the mean as a function of s and be removed from the non-zero mean 
model. Under these circumstance, a general method of estimation, like maximum likeli¬ 
hood estimation or least square estimation, can be used. Then, the Gaussian Process can 
be converted into a zero mean GP. However, if necessary, the mean tensor itself can be re¬ 
garded as a random variable and learnt from the data [3] . The modelling of the covariance 
structure of this GP is discussed in the following subsection. 

2.1 Covariance structure 

In this context, it is relevant that a A:-dimensional random tensor S £ j^mixm 2 ...xmk 
be decomposed to a unit random k dimensional tensor {Z) and k number of covariance 
matrix by Tucker product [1]: 

S = .Z Xi Si X2 S2... Xj, Sfc ( 2 ) 

where the p-th covariance matrix is nip x nip matrix and nip £ Z>o, 1 ^ 2 ^ ■ ■ ■ 1 ™fc} 

for the tensor S that is mi x m 2 x ... x mfe-dimensional . 

We choose to model the covariance structure of the GP with a Squared Exponential 
(SQE) covariance function. The implementation of this can be expressed in different ways, 
but in this initial phase of the project, we perform parametrisation of the covariance 
structure using the Tucker Product that has been extensively studied[T]. It is recalled that 
the SQE form can be expressed as 

k 

p(U|M,Si,...,Sfe) = (27r)-’"/2(J]|S,|-'"^''"-)xexp(-||(U-M)xiAr'x2A^\..XfcA-i72) 

2 = 1 


( 3 ) 


where m = fliLi Sp = ApA^. 

Although this probability density function is well structured and can in principle be 
used to model high dimensional data, the computational complicity increases with large 
and/or high-dimensional data sets. If we do not implement a particular parametric model 
for the covariance kernels but aim to learn each element of each covariance matrix, the total 

k 

number of parameters in the covariance structure to be then learnt, ends up as This 

p=i 

could be a big number for a large data set and the computational demand on such learning 
can be formidable. Also, the computational task of inverting the covariance matrix Sp is 
in itself highly resource intensive, with the demand on time and computational resources 
increasing with the dimensions of Sp, p = 1 ,..., /c. 

3 Application 

We perform an empirical illustration of our method, to hrst learn the covariance structure 
of a large astronomical training data set, and thereafter, employ such learning towards 
the prediction of the value of the unknown model parameter at which the test data is 
realised. The training data comprises has 216 observations, where an observation consti¬ 
tutes a sequence of 2-dimensional vectors. In fact, each such 2-dimensional vector is a 
2-dimensional velocity vector of a star that is a neighbour of the Sun, as tracked within 
an astronomical simulation [2] of the disk of our Galaxy. There are 50 stars tracked at 
each design point i.e. at each assigned value of the unknown model parameter vector, that 
is in this application is the location of the Sun in the two-dimensional, (by assumption). 
Milky Way disk. In other words, S itself is a 2-dimensional vector. There are 216 design 
points used to generate this (simulated) training data that then constitutes 216 number 
of 50 X 2-dimensional velocity matrices, with each velocity matrix generated at each of the 
216 design points in this training data. Thus, the training data D in this application is 
216 X 50 X 2-dimensional 3-tensor 

To reduce the difficulty of MCMC algorithm, the mean tensor is estimated by the 
maximum likelihood estimation. 

When building the covariance structure of this training data set, the likelihood of 
which is now 3-tensor-normal, we consider three covariance matrices. Of these, the 216 x 
216-dimensional covariance matrix Si bears information about the correlation between 
velocity matrices generated at the 216 different values of S, i.e. at the 216 different solar 
locations in the Milky Way disk. The 50 x 50 covariance matrix S 2 illustrates the corre¬ 
lation between any pair of the 50 stars at a given s, that are tracked in the astronomical 
simulation and the last covariance matrix S 3 represents the correlation between the 2 
components of the velocity vector of a star that is tracked at a given s for its velocty in 
the astronomical simulation. If we learn the elements of each covarianc matrix directly, we 
will have 216 x 216 -f 50 x 50 -f 2 x 2 number of parameters to learn, which is too many 
given limits of time and computational resourse. Thus, we model the covariance kernels 
using known forms, the parameters of which we then learn from the data. 

In particular, we use the Squared Exponential (SQE) covariance function to model 
the 216 X 216 matrix Si and learn the correlation lengths-or rather their reciprocals, 
the smoothing parameters-using the training data. As the 216 velocity matrices are each 
generated at a respective value of S, Si can be written as Si = [a^] where i,j = 1 ,..., 216 
with 

— {si — Sj) Qi [si — Sj) , 


ttij = exp 




where Qi is a, dx d square diagonal matrix, with S £ As d = 2 in our application, we 
learn 2 smoothness parameters. 

The covariance matrix S 2 quantifies correlation amongst the different stellar velocity 
vectors generated at a given s. The 50 stellar velocity vectors that are recorded at a given 
s are chosen over other values of stellar velocity vectors. Given that the velocity vector of 
each star is 2 -dimensional, we again learn 2 smoothness parameters (diagonal elements of 
matrix Q 2 )) using an SQE model. 

In addition we learn the 4 parameters of the covariance matrix S 3 . 

Thus, we will have 8 parameters of covariance 

structure to learn from the data, where these parametersare defined as in: 


Qi 



ol) 


; Q2 — 




In the initial phase of the project that is currently underway, we write the joint posterior 
probability density of the unknown parameters and sample from it using a variant of the 
metropolis-Hastings algorithm, referred to as Transformation-based MCMC (TMCMC). 
To write the posterior, we impose uniform priors on each of our unknowns. 


Table 1: Priors for parameters 
Parameters Prior 



Uniform 

« 1 

<1S 

Uniform 

^('?22') « 1 

yii 

Uniform 

) « 1 

( 2 ) 

^22 

Uniform 

« 1 

S 3 

Non-informative 

7r(S3) OC |S3|“ 


The proposal density that we use in our MCMC scheme, to generate updates for each 
of our parameters is tabulated within the section in which TMCMC is described. The 
results of our learning and estimation of the mean and covariance structure of the CP 
used to model this tensor-variate data, is discussed below in Sectionj^ Once this phase of 
the work is over, we will proceed to include the test and training data both, to write the 
joint posterior probability density of and the 8 unknowns in Qi,Q 2 ,^ 3 , and learn 

all these parameters. 

4 Transformation based MCMC 

We are using the Transformation based MCMC algorithm to estimating the parameters. 
Although the TMCMC method will lose some of the information, the method is efficient 
in high dimensional distributions. 

• l.Set initial value Sq, qQ^\ ..., , counter n = 1 and a forward probability poi ■ ■ ■ ,Pk 

• 2.Generate e ~ Gamma{l, 1) and u ^U{0, 1) independently. 

• 3.If u < po, let s' = s„_i -h /3oe. Else, let s' = 5„_i — Pqc 

• 4.Repeat step 2 and step 3 for gj,..., 

• 5.Calculate the acceptance rate: 

_ Y{i(iDV^ X njgDc(l -Pj) ^ posterior {s', q[, ...g^,) 

niGD(f ~ Pi) ^ IljeD<=Pj posterior{Sn-i,qnli^ ■■■Qnli) 








where, set D is the elements which has the backward transform(u > p) and set is 
the elements which has the forward transform(u <= p). 

• 6.Accept s' ,q[,... ,q'f. as s„, qn \ q^'^ with probability a or drop s', q'^,...,q'f. with 

probability 1 — a 

• 7.Repeat 2 to 6 until the chain get convergence. 

5 Results 
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Figure 1: trace of the likelihood generated by Figure 2: marginal probability density for q[^i 
TMCMC 



Figure 3: marginal probability density for Figure 4: marginal probability density for 
cr[^ cr[^ in full line and cr^i in broken line 

In the top left panel of Figure we present the trace of the likelihood of the training 
data given the 8 unknowns in Qi,Q 2 ,^ 3 , with 2 x 10^ of iterations. The stationarity of 
the trace betrays the achievement of convergence of the chain. 

The marginal posterior probability densities of each unknown parameter is also learnt 
using TMCMC. The same for parameters (Figurej^, an (Figure]^ and a^ (Figure 






























I^are shown in the top right and bottom left and right panels. As noticed in the inequality 
of the marginals of the non-diagonal elements of S 3 shown in the bottom panels of this 
figure, the covariance structure for this astronomical data set does not appear to adhere 
to stationarity. Had the covariance been stationary, the 1,2-th and 2,1-th elements would 
be equal, i.e. their marginals would coincide. But such is not the case as evident from 
comparing the two density in figure]^ which shows a drift from (T 12 to cr 2 i. This further 
suggests that our modelling of the S 2 matrix using SQE covariance function is pre-matured. 
We are exploring the implementation of non-stationary covariance modelling of s. 
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